Skip to content

ROCm: Add gfx950 (MI355X/CDNA4) to is_cdna() and include PR #4021 fixes#4050

Closed
GoldenGrapeGentleman wants to merge 18 commits intounslothai:mainfrom
GoldenGrapeGentleman:fix/warn-unsupported-lora-targets
Closed

ROCm: Add gfx950 (MI355X/CDNA4) to is_cdna() and include PR #4021 fixes#4050
GoldenGrapeGentleman wants to merge 18 commits intounslothai:mainfrom
GoldenGrapeGentleman:fix/warn-unsupported-lora-targets

Conversation

@GoldenGrapeGentleman
Copy link
Contributor

Summary

Add AMD Instinct MI355X (gfx950 / CDNA4) support to is_cdna() and include ROCm stability fixes from PR #4021 by @danielhanchen.

Problem

is_cdna() only listed gfx940/941/942 (MI300 series). MI355X (gfx950, CDNA4) has the same 1024-thread workgroup limit but was missing, causing all Triton kernels to use num_warps=32 (2048 threads) instead of 16 (1024 threads):

triton.runtime.errors.OutOfResources: out of resource: threads, Required: 2048, Hardware limit: 1024

This blocked all training on MI355X.

Changes

Verified on 8× AMD Instinct MI355X (gfx950), ROCm 7.1

Test Result
Vision RL GRPO (Qwen2.5-VL-7B) ✅ 5/5 steps, 117s
Code RL GRPO (gpt-oss-20b BF16) ✅ 20/20 steps, 470s
gpt-oss-120b GRPO (8-GPU) ✅ 5/5 steps, 328s
MoE expert LoRA + merge ✅ 46.2M trainable, merge success

cc @danielhanchen — this includes your PR #4021 changes, cherry-picked and validated on MI355X. The is_cdna() fix is the additional piece needed for CDNA4.

danielhanchen and others added 18 commits February 10, 2026 14:36
MI355X (gfx950) has the same 1024-thread workgroup limit as MI300X (gfx942),
but was missing from is_cdna(), causing all Triton kernels to use num_warps=32
(2048 threads) instead of 16 (1024 threads), resulting in OutOfResources crash.

Also includes ROCm GPT-OSS BF16 routing and dequant buffer dtype fix from PR unslothai#4021
by @danielhanchen, cherry-picked for MI355X validation.

Tested on: 8x AMD Instinct MI355X (gfx950), ROCm 7.1
- Vision RL GRPO (Qwen2.5-VL-7B): 5/5 steps
- Code RL GRPO (gpt-oss-20b BF16): 20/20 steps
- gpt-oss-120b GRPO: 5/5 steps (B200 OOM'd on this)
- MoE expert LoRA + save_pretrained_merged: success
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @GoldenGrapeGentleman, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances ROCm compatibility and stability by extending support to AMD Instinct MI355X (gfx950/CDNA4) GPUs, which directly addresses Triton kernel thread limit issues. It also integrates a suite of stability fixes from a previous pull request, focusing on robust model loading for GPT-OSS on HIP devices, refining dequantization logic, and proactively mitigating potential AITER-related problems on ROCm.

Highlights

  • ROCm GFX950 (MI355X/CDNA4) Support: Added support for AMD Instinct MI355X (gfx950/CDNA4) to the is_cdna() check, resolving critical Triton kernel thread limit errors that previously blocked training on these devices.
  • Integrated PR ROCm: default GPT-OSS to BF16 and disable AITER #4021 Stability Fixes: Incorporated several stability improvements from PR ROCm: default GPT-OSS to BF16 and disable AITER #4021, including enhanced routing for GPT-OSS models on ROCm, a fix for dequantization buffer dtype handling, and default disabling of AITER to prevent JIT build locks and runtime faults on ROCm.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • unsloth/init.py
    • Wrapped torch.cuda.is_bf16_supported() and torch.xpu.is_bf16_supported() in local is_bf16_supported functions for hip and xpu devices.
  • unsloth/device_type.py
    • Imported the os module.
    • Added default environment variable settings to disable AITER (AITER_DISABLE=1) and USE_ROCM_AITER_ROPE_BACKEND=0 for HIP devices to prevent JIT build locks and runtime faults.
  • unsloth/kernels/utils.py
    • Extended the is_cdna() function to include "gfx950" (CDNA4 MI350/MI355X) in the list of CDNA architectures.
    • Modified fast_dequantize to reallocate WEIGHT_BUFFER if its dtype changes, in addition to checking if it's None.
  • unsloth/models/_utils.py
    • Enhanced patch_gradient_accumulation_fix to include try-except blocks and checks for the generated _unsloth___init__ function, improving robustness during patching.
    • Updated _prepare_model_for_qat to attempt initializing Int4WeightOnlyConfig with version=2 and included a TypeError fallback for older TorchAO versions.
  • unsloth/models/loader.py
    • Introduced a new internal function _route_hip_gpt_oss_model to manage model loading for GPT-OSS on HIP devices, specifically routing to BF16 models for Instinct/MI GPUs if pre-quantized 4-bit models are not suitable.
    • Integrated the _route_hip_gpt_oss_model call into the from_pretrained function at two different points.
  • unsloth/save.py
    • Added a condition to skip GGUF conversion and return early if the UNSLOTH_GGUF_OFFLINE environment variable is set to "1".
    • Included an early return with default values if all_file_locations is empty, preventing index errors after a skipped or failed GGUF conversion.
Activity
  • The author has validated the changes on 8x AMD Instinct MI355X (gfx950) with ROCm 7.1, providing detailed test results for various models and tasks.
  • The pull request incorporates cherry-picked changes from PR ROCm: default GPT-OSS to BF16 and disable AITER #4021 by @danielhanchen, indicating collaboration and integration of prior work.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for the AMD Instinct MI355X (gfx950) GPU and incorporates several stability fixes for ROCm, which is a valuable enhancement. The changes are logical and well-implemented, particularly the safety improvements around dynamic code execution and buffer handling. I've identified a couple of areas with code duplication that could be refactored to improve long-term maintainability. Overall, this is a solid contribution that improves hardware support and the robustness of the library.

Comment on lines 207 to +218
elif DEVICE_TYPE == "hip":
SUPPORTS_BFLOAT16 = torch.cuda.is_bf16_supported()

def is_bf16_supported():
return SUPPORTS_BFLOAT16
elif DEVICE_TYPE == "xpu":
# torch.xpu.is_bf16_supported() does not have including_emulation
# set SUPPORTS_BFLOAT16 as torch.xpu.is_bf16_supported()
SUPPORTS_BFLOAT16 = torch.xpu.is_bf16_supported()

def is_bf16_supported():
return SUPPORTS_BFLOAT16
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To improve maintainability and reduce code duplication, you can refactor this logic. The is_bf16_supported function is defined identically for both hip and xpu device types. Combining the elif blocks for hip and xpu and defining the function only once would make the code cleaner.

Suggested change
elif DEVICE_TYPE == "hip":
SUPPORTS_BFLOAT16 = torch.cuda.is_bf16_supported()
def is_bf16_supported():
return SUPPORTS_BFLOAT16
elif DEVICE_TYPE == "xpu":
# torch.xpu.is_bf16_supported() does not have including_emulation
# set SUPPORTS_BFLOAT16 as torch.xpu.is_bf16_supported()
SUPPORTS_BFLOAT16 = torch.xpu.is_bf16_supported()
def is_bf16_supported():
return SUPPORTS_BFLOAT16
elif DEVICE_TYPE in ("hip", "xpu"):
if DEVICE_TYPE == "hip":
SUPPORTS_BFLOAT16 = torch.cuda.is_bf16_supported()
else: # xpu
# torch.xpu.is_bf16_supported() does not have including_emulation
# set SUPPORTS_BFLOAT16 as torch.xpu.is_bf16_supported()
SUPPORTS_BFLOAT16 = torch.xpu.is_bf16_supported()
def is_bf16_supported():
return SUPPORTS_BFLOAT16

Comment on lines +278 to +294
(
model_name,
load_in_4bit,
load_in_8bit,
load_in_fp8,
load_in_16bit,
quantization_config,
) = _route_hip_gpt_oss_model(
model_name = model_name,
use_exact_model_name = use_exact_model_name,
load_in_4bit = load_in_4bit,
load_in_8bit = load_in_8bit,
load_in_fp8 = load_in_fp8,
load_in_16bit = load_in_16bit,
quantization_config = quantization_config,
kwargs = kwargs,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This block of code for routing HIP GPT-OSS models is duplicated in FastModel.from_pretrained at lines 880-896. To improve maintainability and reduce redundancy, consider refactoring this logic into a shared helper method that both FastLanguageModel.from_pretrained and FastModel.from_pretrained can call. This would centralize the model routing logic, making future updates easier.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 41c5a9639f

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +1473 to +1477
if not lower_model_name.endswith("-bf16"):
if "120b" in lower_model_name:
model_name = "unsloth/gpt-oss-120b-BF16"
else:
model_name = "unsloth/gpt-oss-20b-BF16"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Restrict HIP GPT-OSS remap to canonical model IDs

_route_hip_gpt_oss_model rewrites matched names to unsloth/gpt-oss-20b-BF16/120b-BF16 based only on substring matching, so on HIP it can replace requested non-base models (e.g. unsloth/gpt-oss-safeguard-20b, which is a valid mapped ID in unsloth/models/mapper.py:1246-1252) and local checkpoint paths that include gpt-oss. In those cases the loader silently fetches different weights than the caller asked for, which can invalidate training/evaluation results.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants